Instructions

Welcome to the visual presentation of the Cyclistic bicycle company’s data analysis project, where we present the final steps and analyses in the form of a chronologically ordered series of images representing the stages of the study from the beginning to the final results.

I introduce myself as a student in the Data Analyst course offered by Coursera in collaboration with Google. I am at the end of the project and have reached the case study stage in front of you, in which I will present the scenario and introduction to this project. After that, we will document all the work from beginning to end through structural thinking, a plan, and a roadmap. I am trying to present this project in a professional and organized manner and using the methodology that I learned throughout my career in this course.

As a junior data analyst, the insights derived from this analysis will serve as the foundation for the marketing team to design a new, targeted strategy. The ultimate goal is to maximize the number of annual memberships by understanding the differences in usage patterns between the two customer types. This understanding will enable the creation of marketing campaigns specifically aimed at converting existing casual riders into more profitable annual members

With this information,the goal of my team is to design a new marketing strategy to convert casual riders into annual members through these questions we need to better understand:

  1. How do annual members and casual riders use Cyclistic bikes differently?
  2. Why would casual riders buy Cyclistic annual memberships?
  3. How can Cyclistic use digital media to influence casual riders to become members? Let’s begin with the steps of the methodology data analysis process: Ask, Prepare, Process, Analyze, Share, and Act

Case Study Roadmap - Ask

● What is the problem you are trying to solve? ● How can your insights drive business decisions?

Clear Statement of the Business Task The primary business task is to analyze the historical bike trip data to determine how annual members and casual riders use Cyclistic bikes differently

● Identify the business task ● Consider key stakeholders ● A clear statement of the business task ● Identify the specific data needed (e.g., historical trip data). ● Determine data credibility and integrity.

##Tools used

Spreadsheet: Microsoft Excel Data Analysis: SQL (Google Big Query) and R Programming Language (R Studio) Visualization: Tableau Public and R Programming Language (R Studio)

Prepare

We will start by using Cyclist’s historical flight data to determine trends. We will use the R desktop To avoid storage and file size problems. Download bicycle trip data for the previous 12 months from 2022/05 to 2023/04 kaggle.com . (Note: The datasets have a different name because Cyclistic is a fictional company. For the purposes of this case study, the datasets are appropriate and will enable you to answer the business questions. The data has been made available by Motivate International Inc. under this license.) This is public data that you can use to explore how different customer types are using Cyclist bikes But note that data-privacy issues prohibit you from using riders’ personally identifiable information. This means that you won’t be able to connect pass purchases to credit card numbers to determine if casual riders live in the Cyclistic service area or if they have purchased multiple

This score is calculated by Kaggle. link Version 1 (1.13 GB)

Usability

Completeness 100%

Check Subtitle Check Tag Check Description Check Cover Image

Credibility 67%

check Source/Provenance check Public Notebook close Update Frequency close check Column Description

Compatibility 100%

check License check File Format check File Description

First stage load and Merge

This report will present the work using tools

R Programming Language (R Studio) Visualization: Tableau Public and R Programming Language (R Studio)

R Programming

Download the required libraries

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(glue)
library(htmltools)
library(tidyverse)
library(janitor)
library(lubridate)

Determine the paths

main_dir <- here::here("C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy")
data_raw_path <- file.path(main_dir, "data", "data_row")
data_cleaned_path <- file.path(main_dir, "data", "cleaned_data")

Create a cleaned_data folder if it doesn’t already exist

if (!dir.exists(data_cleaned_path)) {
  dir.create(data_cleaned_path, recursive = TRUE)
}

Fetch CSV file list

files <- list.files(path = data_raw_path, pattern = "*.csv", full.names = TRUE)

cat("Found", length(files), "data file.\n")
## Found 12 data file.

Read and merge all files

all_trips <- files %>%
  map_dfr(function(file) {
    cat("Loading:", basename(file), "\n")
    read_csv(file, show_col_types = FALSE)
  })
## Loading: 202205-divvy-tripdata.csv 
## Loading: 202206-divvy-tripdata.csv 
## Loading: 202207-divvy-tripdata.csv 
## Loading: 202208-divvy-tripdata.csv 
## Loading: 202209-divvy-publictripdata.csv 
## Loading: 202210-divvy-tripdata.csv 
## Loading: 202211-divvy-tripdata.csv 
## Loading: 202212-divvy-tripdata.csv 
## Loading: 202301-divvy-tripdata.csv 
## Loading: 202302-divvy-tripdata.csv 
## Loading: 202303-divvy-tripdata.csv 
## Loading: 202304_divvy_tripdata.csv
cat("✅ Merge all files successfully.\n")
## ✅ Merge all files successfully.

Preliminary cleaning of columns

all_trips <- clean_names(all_trips)

# Show first 5 rows to check
print(head(all_trips, 5))
## # A tibble: 5 × 13
##   ride_id          rideable_type started_at          ended_at           
##   <chr>            <chr>         <dttm>              <dttm>             
## 1 EC2DE40644C6B0F4 classic_bike  2022-05-23 23:06:58 2022-05-23 23:40:19
## 2 1C31AD03897EE385 classic_bike  2022-05-11 08:53:28 2022-05-11 09:31:22
## 3 1542FBEC830415CF classic_bike  2022-05-26 18:36:28 2022-05-26 18:58:18
## 4 6FF59852924528F8 classic_bike  2022-05-10 07:30:07 2022-05-10 07:38:49
## 5 483C52CAAE12E3AC classic_bike  2022-05-10 17:31:56 2022-05-10 17:36:57
## # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>
# Displaying available columns
cat("📊 For available columns:\n")
## 📊 For available columns:
print(names(all_trips))
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

Convert the time columns if available

if (all(c("started_at", "ended_at") %in% names(all_trips))) {
  all_trips <- all_trips %>%
    mutate(
      started_at = ymd_hms(started_at, quiet = TRUE),
      ended_at   = ymd_hms(ended_at, quiet = TRUE)
    )
  cat("🕒 The time columns have been converted to datetime format.\n")
} else {
cat("⚠️ The time columns (started_at / ended_at) do not exist or have different names.\n")
}
## 🕒 The time columns have been converted to datetime format.

Save the merged file

output_file <- file.path(data_cleaned_path, "all_trips_raw.csv")
write_csv(all_trips, output_file)
cat("\n💾 The merged file was successfully saved to:\n", output_file, "\n")
## 
## 💾 The merged file was successfully saved to:
##  C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/cleaned_data/all_trips_raw.csv
# 8. Quick overview of size and structure

cat("\n💾 The merged file was successfully saved to:\n", output_file, "\n")
## 
## 💾 The merged file was successfully saved to:
##  C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/cleaned_data/all_trips_raw.csv
cat("\n📈 Number of rows:", nrow(all_trips), "\n")
## 
## 📈 Number of rows: 5859061
cat("📊 Number of columns:", ncol(all_trips), "\n")
## 📊 Number of columns: 13

02 cleaning

Required libraries

——————————-

If it is not installed, install it first:

install.packages(c(“tidyverse”,“lubridate”,“janitor”,“skimr”,“readr”,“vroom”))

library(tidyverse)
library(lubridate)
library(janitor)
library(skimr)
library(readr)   
library(here)

1 Paths (transferable via here::here())

main_dir <- here::here("C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy")
raw_merged_file <- file.path(main_dir, "data", "cleaned_data", "all_trips_raw.csv")
cleaned_dir <- file.path(main_dir, "data", "cleaned_data")
outputs_dir <- file.path(main_dir, "data", "outputs")

if (!dir.exists(cleaned_dir)) dir.create(cleaned_dir, recursive = TRUE)
if (!dir.exists(outputs_dir)) dir.create(outputs_dir, recursive = TRUE)

cat("📂 Read the merged file from:\n", raw_merged_file, "\nn")
## 📂 Read the merged file from:
##  C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/cleaned_data/all_trips_raw.csv 
## n

2 Read the file (safely) + quick scan

We use read_csv with show_col_types = FALSE because the file size is large

all_trips <- read_csv(raw_merged_file, show_col_types = FALSE)
cat("✅ File read. Data dimensions: ", dim(all_trips)[1], "rows x", dim(all_trips)[2], "columns\n\n")
## ✅ File read. Data dimensions:  5859061 rows x 13 columns

Quick look at columns and empty values

cat("🔎 Quick look at columns:\n")
## 🔎 Quick look at columns:
print(names(all_trips))
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
cat("\n")
cat("📋 Quick summary of some variables (skim):\n")
## 📋 Quick summary of some variables (skim):
print(skim(all_trips))
## ── Data Summary ────────────────────────
##                            Values   
## Name                       all_trips
## Number of rows             5859061  
## Number of columns          13       
## _______________________             
## Column type frequency:              
##   character                7        
##   numeric                  4        
##   POSIXct                  2        
## ________________________            
## Group variables            None     
## 
## ── Variable type: character ────────────────────────────────────────────────────
##   skim_variable      n_missing complete_rate min max empty n_unique whitespace
## 1 ride_id                    0         1      16  16     0  5859061          0
## 2 rideable_type              0         1      11  13     0        3          0
## 3 start_station_name    832009         0.858   3  64     0     1722          0
## 4 start_station_id      832141         0.858   3  36     0     1319          0
## 5 end_station_name      889661         0.848   3  64     0     1741          0
## 6 end_station_id        889802         0.848   3  36     0     1324          0
## 7 member_casual              0         1       6   6     0        2          0
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean     sd    p0   p25   p50   p75
## 1 start_lat             0         1      41.9 0.0458  41.6  41.9  41.9  41.9
## 2 start_lng             0         1     -87.6 0.0287 -87.8 -87.7 -87.6 -87.6
## 3 end_lat            5973         0.999  41.9 0.0672   0    41.9  41.9  41.9
## 4 end_lng            5973         0.999 -87.6 0.106  -88.1 -87.7 -87.6 -87.6
##    p100 hist 
## 1  42.1 ▁▁▇▇▁
## 2 -87.5 ▁▁▆▇▁
## 3  42.4 ▁▁▁▁▇
## 4   0   ▇▁▁▁▁
## 
## ── Variable type: POSIXct ──────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate min                 max                
## 1 started_at           31         1.000 2022-05-01 00:00:06 2023-04-30 23:59:05
## 2 ended_at             23         1.000 2022-05-01 00:05:17 2023-05-03 10:37:12
##   median              n_unique
## 1 2022-08-28 12:44:56  4916295
## 2 2022-08-28 13:07:18  4930151

——————————-

3 Unify column names (if they are not already unified)

——————————-

all_trips <- all_trips %>% clean_names() # janitor::clean_names()

————————————————————–

4 Correcting the names of the basic categories (member_casual)

————————————————————–

Unifying the values ​​of member_casual to “member” and “casual” (case-sensitive)

if ("member_casual" %in% names(all_trips)) {
all_trips <- all_trips %>%

mutate(member_casual = tolower(member_casual) %>% str_trim()) %>%

mutate(member_casual = case_when(
member_casual %in% c("member", "registered", "subscriber") ~ "member",

member_casual %in% c("casual", "customer", "guest") ~ "casual",

TRUE ~ member_casual
))
} else {
stop("⚠️ The column 'member_casual' does not exist in the data — check From the naming of the columns.")
}

——————————-

5 Robustly Convert Time Columns

——————————-

Some files may contain different time formats - try several times

parse_safe_datetime <- function(x) {
res <- suppressWarnings(ymd_hms(x, quiet = TRUE))

if (all(is.na(res))) res <- suppressWarnings(ymd_hm(x, quiet = TRUE))
if (all(is.na(res))) res <- suppressWarnings(as_datetime(x))

return(res)
}

if (all(c("started_at", "ended_at") %in% names(all_trips))) {
all_trips <- all_trips %>%

mutate(
start_at = parse_safe_datetime(started_at),

end_at = parse_safe_datetime(ended_at)

)
cat("🕒 Started_at and ended_at have been converted to datetime (multiple attempts).\n")
} else {
stop("⚠️ The started_at or ended_at columns do not exist.")
}
## 🕒 Started_at and ended_at have been converted to datetime (multiple attempts).

——————————-

6. Duplicate Handling and ride_id

——————————-

If a ride_id exists, we use it to remove exact duplicates

if ("ride_id" %in% names(all_trips)) {
before_dup <- nrow(all_trips)

all_trips <- all_trips %>% distinct(ride_id, .keep_all = TRUE)

after_dup <- nrow(all_trips)

cat("🧹 Remove duplicates according to ride_id:", before_dup - after_dup, "Deleted.\n")
} else {
cat("⚠️ No ride_id; we will attempt to remove exact duplicates if any exist.\n")
before_dup <- nrow(all_trips)
all_trips <- all_trips %>% distinct()
after_dup <- nrow(all_trips)
cat("🧹 Deleted", before_dup - after_dup, "Complete iterations.\n")
}
## 🧹 Remove duplicates according to ride_id: 0 Deleted.

—————————————————————————

7 Calculating Trip Duration (Seconds and Minutes) and Checking for Outliers

—————————————————————————

all_trips <- all_trips %>%

mutate(
ride_length_sec = as.numeric(difftime(ended_at, started_at, units = "secs")),

ride_length_min = ride_length_sec / 60
)

Remove trips with negative, zero, or longer durations than a reasonable limit (e.g., 24 hours = 1440 minutes)

You can adjust the maximum duration as you see fit (e.g., 6 hours = 360 minutes)

min_valid_min <- 0.1 # Less than 6 seconds is considered zero/irrational
max_valid_min <- 1440 # 24 hours is the reasonable maximum

before_filter <- nrow(all_trips)
all_trips <- all_trips %>%
filter(!is.na(ride_length_min)) %>%
filter(ride_length_min > min_valid_min & ride_length_min <= max_valid_min)
after_filter <- nrow(all_trips)
cat("🚫 Irrational trips removed (negative/zero/longer than limit):", before_filter - after_filter, "\n")
## 🚫 Irrational trips removed (negative/zero/longer than limit): 26498

————————————————————

8. Handling Missing Values ​​in Stations (A Practical Strategy)

—————————————————————

If station names are missing, the class (dockless trips) can be retained or tagged

if ("start_station_name" %in% names(all_trips)) {
all_trips <- all_trips %>%
mutate(
start_station_name = ifelse(is.na(start_station_name) | start_station_name == "",
"unknown_start", start_station_name),

end_station_name = ifelse(is.na(end_station_name) | end_station_name == "",

"unknown_end", end_station_name)

)
}

Similarly, for station_id

if ("start_station_id" %in% names(all_trips)) {
all_trips <- all_trips %>% 
mutate( 
start_station_id = ifelse(is.na(start_station_id) | start_station_id == "", 
"unknown_start_id", start_station_id), 
end_station_id = ifelse(is.na(end_station_id) | end_station_id == "", 
"unknown_end_id", end_station_id) 
)
}

————————————–

9 Add useful variables to the analysis

————————————–

all_trips <- all_trips %>% 
mutate( 
date = as_date(started_at), 
year = year(started_at), 
month = month(started_at, label = TRUE, abbr = TRUE), 
day_of_week = wday(started_at, label = TRUE, abbr = TRUE), 
day_of_week_full = wday(started_at, label = TRUE, abbr = FALSE), 
hour = hour(started_at), 
is_weekend = if_else(day_of_week %in% c("Sat", "Sun"), TRUE, FALSE), 
season = case_when( 
month %in% c("Dec", "Jan", "Feb") ~ "Winter", 
month %in% c("Mar", "Apr", "May") ~ "Spring",

month %in% c("Jun", "Jul", "Aug") ~ "Summer",

TRUE ~ "Fall"

)
)

Bucket for duration to facilitate analysis (e.g., short < 10 min, medium 10-30 min, long > 30 min)

all_trips <- all_trips %>%
mutate(
trip_length_bucket = case_when(
ride_length_min < 10 ~ "short_<10 min",
ride_length_min >= 10 & ride_length_min < 30 ~ "medium_10-30 min",
ride_length_min >= 30 ~ "long_>=30 min",

TRUE ~ "unknown"

)
)

Logically sorting the factors for the days of the week

all_trips <- all_trips %>% 
mutate(day_of_week = factor(day_of_week, levels = c("Mon","Tue","Wed","Thu","Fri","Sat","Sun")))

——————————-

10. Some Final Sanity Checks

——————————-

Final Number of Rows/Columns

cat("\n📊 Final Dimensions: ", nrow(all_trips), "rows ×", ncol(all_trips), "columns\n")
## 
## 📊 Final Dimensions:  5832563 rows × 26 columns

Unexpected Repetitions

cat("\n🔍 Number of Null Values ​​per Column (Top 10):\n")
## 
## 🔍 Number of Null Values ​​per Column (Top 10):
na_summary <- all_trips %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "column", values_to = "n_missing") %>%
arrange(desc(n_missing))

print(head(na_summary, 20))
## # A tibble: 20 × 2
##    column             n_missing
##    <chr>                  <int>
##  1 end_lat                  759
##  2 end_lng                  759
##  3 ride_id                    0
##  4 rideable_type              0
##  5 started_at                 0
##  6 ended_at                   0
##  7 start_station_name         0
##  8 start_station_id           0
##  9 end_station_name           0
## 10 end_station_id             0
## 11 start_lat                  0
## 12 start_lng                  0
## 13 member_casual              0
## 14 start_at                   0
## 15 end_at                     0
## 16 ride_length_sec            0
## 17 ride_length_min            0
## 18 date                       0
## 19 year                       0
## 20 month                      0

Top 10 Starting Stations

if ("start_station_name" %in% names(all_trips)) {
cat("\n🏁 Top 10 Start Stations (by trips):\n")
print(all_trips %>% count(start_station_name, sort = TRUE) %>% slice_head(n = 10))
}
## 
## 🏁 Top 10 Start Stations (by trips):
## # A tibble: 10 × 2
##    start_station_name                      n
##    <chr>                               <int>
##  1 unknown_start                      827888
##  2 Streeter Dr & Grand Ave             74544
##  3 DuSable Lake Shore Dr & Monroe St   41189
##  4 Michigan Ave & Oak St               40182
##  5 DuSable Lake Shore Dr & North Blvd  40061
##  6 Wells St & Concord Ln               38411
##  7 Clark St & Elm St                   36263
##  8 Millennium Park                     34778
##  9 Kingsbury St & Kinzie St            34538
## 10 Theater on the Lake                 33238

Member vs Casual Distribution

cat("\n👥 Member vs Casual Distribution:\n")
## 
## 👥 Member vs Casual Distribution:
print(all_trips %>% count(member_casual) %>% mutate(pct = n / sum(n) * 100))
## # A tibble: 2 × 3
##   member_casual       n   pct
##   <chr>           <int> <dbl>
## 1 casual        2346869  40.2
## 2 member        3485694  59.8

————————————

11. Saving Clean Files and Summaries

———————————–

cleaned_csv <- file.path(cleaned_dir, "all_trips_cleaned.csv")
cleaned_rds <- file.path(cleaned_dir, "all_trips_cleaned.rds")
summary_file <- file.path(outputs_dir, "cleaning_summary.csv")

cat("\n💾 Save clean files...\n")
## 
## 💾 Save clean files...
write_csv(all_trips, cleaned_csv)
saveRDS(all_trips, cleaned_rds, compress = "xz") # Compressed RDS for easier later uploading

Saving a summary (quick metrics) to CSV

summary_table <- tibble(
metric = c("n_rows", "n_cols", "n_unique_ride_id"), 
value = c(nrow(all_trips), ncol(all_trips), ifelse("ride_id" %in% names(all_trips), n_distinct(all_trips$ride_id), NA))
)

write_csv(summary_table, summary_file)

cat("✅ cleaned CSV, RDS and summary file saved to:\n", 
cleaned_csv, "\n", cleaned_rds, "\n", summary_file, "\n\n")
## ✅ cleaned CSV, RDS and summary file saved to:
##  C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/cleaned_data/all_trips_cleaned.csv 
##  C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/cleaned_data/all_trips_cleaned.rds 
##  C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/outputs/cleaning_summary.csv

——————————————————————

12. Also, save some useful summarization tables for quick analysis

—————————————————————–

Example: Number of trips by member_casual and day_of_week

trips_by_day <- all_trips %>%
group_by(member_casual, day_of_week) %>%
summarise(ride_count = n(), avg_duration_min = mean(ride_length_min, na.rm = TRUE), .groups = "drop")
write_csv(trips_by_day, file.path(outputs_dir, "trips_by_day_member_week.csv"))

Example: Monthly activity

monthly_trips <- all_trips %>%
group_by(year, month, member_casual) %>%
summarise(ride_count = n(), avg_duration_min = mean(ride_length_min, na.rm = TRUE), .groups = "drop")
write_csv(monthly_trips, file.path(outputs_dir, "monthly_trips_member.csv"))
cat("✅ Useful grouping tables have been saved inside:", outputs_dir, "\n\n")
## ✅ Useful grouping tables have been saved inside: C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/outputs

——————————————–

13. Conclusion and Move On to the Next Stage

——————————————-

cat("🎉 Data cleaning is complete! You now have a clean file ready for analysis, Exploratory Analysis (EDA), and Visualization.\n")
## 🎉 Data cleaning is complete! You now have a clean file ready for analysis, Exploratory Analysis (EDA), and Visualization.
cat("👉 Saved files:\n -", cleaned_csv, "\n -", cleaned_rds, "\n -", summary_file, "\n")
## 👉 Saved files:
##  - C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/cleaned_data/all_trips_cleaned.csv 
##  - C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/cleaned_data/all_trips_cleaned.rds 
##  - C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/outputs/cleaning_summary.csv
cat("\nSuggested next step: Run the Exploratory Analysis script `03_analysis.R` for detailed graphs and statistics.\n")
## 
## Suggested next step: Run the Exploratory Analysis script `03_analysis.R` for detailed graphs and statistics.

03 analysis R

——————————-

Libraries

——————————-

library(tidyverse)
library(lubridate)
library(readr)
library(janitor)
library(ggplot2)
library(scales)
library(here)

——————————-

1 Download clean data

——————————-

main_dir <- here::here("C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy")
cleaned_file <- file.path(main_dir, "data", "cleaned_data", "all_trips_cleaned.csv")
visual_dir <- file.path(main_dir, "visualizations")

if (!dir.exists(visual_dir)) dir.create(visual_dir, recursive = TRUE)
cat("📂 Reading cleaned dataset from:", cleaned_file, "\n\n")
## 📂 Reading cleaned dataset from: C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/cleaned_data/all_trips_cleaned.csv
all_trips <- read_csv(cleaned_file, show_col_types = FALSE)
cat("✅ Loaded:", nrow(all_trips), "rows ×", ncol(all_trips), "columns\n\n")
## ✅ Loaded: 5832563 rows × 26 columns

——————————-

2 General Usage Analysis

——————————-

usage_summary <- all_trips %>%
group_by(member_casual) %>%
summarise(
total_trips = n(),
avg_duration_min = mean(ride_length_min, na.rm = TRUE),
median_duration = median(ride_length_min, na.rm = TRUE)

)
print(usage_summary)
## # A tibble: 2 × 4
##   member_casual total_trips avg_duration_min median_duration
##   <chr>               <int>            <dbl>           <dbl>
## 1 casual            2346869             21.3            12.5
## 2 member            3485694             12.2             8.7

📊 User Comparison Plot

ggplot(usage_summary, aes(x = member_casual, y = avg_duration_min, fill = member_casual)) +
geom_bar(stat = "identity", width = 0.5) +
labs(title = "⏱️ Average Trip Duration by User Type",
x = "User Type", y = "Average Duration (minutes)") +
theme_minimal() +
scale_fill_manual(values = c("#2E86AB", "#E74C3C"))

ggsave(file.path(visual_dir, "avg_trip_duration_by_member.png"), width = 7, height = 5)

——————————-

3. Time Analysis (Day, Hour, Month)

——————————-

trips_by_day <- all_trips %>%
group_by(member_casual, day_of_week) %>%
summarise(total_rides = n(), avg_duration = mean(ride_length_min), .groups = "drop")
ggplot(trips_by_day, aes(x = day_of_week, y = total_rides, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title = "📆 Number of Trips by Day of the Week",
x = "Day", y = "Number of Trips") +
theme_minimal()

ggsave(file.path(visual_dir, "trips_by_day.png"), width = 7, height = 5)

⏰ Hourly Analysis

trips_by_hour <- all_trips %>%
group_by(member_casual, hour) %>%
summarise(total_rides = n(), .groups = "drop")

ggplot(trips_by_hour, aes(x = hour, y = total_rides, color = member_casual)) +
geom_line(size = 1.1) +
labs(title = "⏰ Trip Distribution by Hour",
x = "Time of Day", y = "Number of Trips") +
theme_minimal()

ggsave(file.path(visual_dir, "trips_by_hour.png"), width = 7, height = 5)

📅 Monthly Analysis

monthly_usage <- all_trips %>%
group_by(year, month, member_casual) %>%
summarise(total_rides = n(), .groups = "drop")

ggplot(monthly_usage, aes(x = month, y = total_rides, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ year) +
labs(title = "📅 Monthly number of trips by user type",
x = "month", y = "number of trips") +
theme_minimal()

ggsave(file.path(visual_dir, "monthly_usage.png"), width = 8, height = 5)

——————————-

4 Bike Type Analysis

——————————-

if ("rideable_type" %in% names(all_trips)) {
ride_type_summary <- all_trips %>%
group_by(member_casual, rideable_type) %>%
summarise(total_rides = n(), avg_duration = mean(ride_length_min), .groups = "drop")
ggplot(ride_type_summary, aes(x = rideable_type, y = total_rides, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title = "🚲 Using Bike Types According to User Type",
x = "Bike Type", y = "Number of Trips") +
theme_minimal()
ggsave(file.path(visual_dir, "ride_type_usage.png"), width = 7, height = 5)
}

——————————-

5 Extracting Insights

——————————-

cat("\n📊 Insights:\n")
## 
## 📊 Insights:
cat("1️⃣ Members often use bikes during the week, while regular users use them on weekends.\n")
## 1️⃣ Members often use bikes during the week, while regular users use them on weekends.
cat("2️⃣ The average trip duration for regular users is much longer than for members.\n")
## 2️⃣ The average trip duration for regular users is much longer than for members.
cat("3️⃣ Members often ride for work commutes, while regular users ride for recreation.\n")
## 3️⃣ Members often ride for work commutes, while regular users ride for recreation.
cat("4️⃣ Members use e-bikes more often → they prefer speed and efficiency.\n")
## 4️⃣ Members use e-bikes more often → they prefer speed and efficiency.

——————————-

6. Save a digital summary of the presentation in Tableau or Power BI

——————————-

insight_summary <- list(
usage_summary = usage_summary,
trips_by_day = trips_by_day,
trips_by_hour = trips_by_hour,
monthly_usage = monthly_usage
)
saveRDS(insight_summary, file.path(visual_dir, "eda_insights_summary.rds"))
cat("\n✅ All results and graphs are saved inside the visualizations folder.\n")
## 
## ✅ All results and graphs are saved inside the visualizations folder.
spatial_summary <- all_trips %>%
# 1. Grouping: By starting coordinates and member category
group_by(start_lat, start_lng, member_casual) %>%
# 2. Summarizing: Calculating the number of trips per group Coordinates
summarize(
ride_count = n() # The n() function counts the number of rows in each group
) %>%
# 3. Remove invalid values: (Optional but recommended for websites)
na.omit() %>%
# 4. Sort: By number of trips in descending order
arrange(desc(ride_count))

head(spatial_summary)
write.csv(spatial_summary, "spatial_summary_for_tableau.csv", row.names = FALSE)

spatial_summary_end <- all_trips %>%

# 1. Group: By the end coordinates of a member
group_by(end_lat, end_lng, member_casual) %>%

# 2. Summarize: Counts the number of trips for each group of coordinates
summarize(
ride_count = n() # The n() function counts the number of rows in each group
) %>%

# 3. Remove invalid values: (Optional but recommended for websites)
na.omit() %>%

# 4. Sort by number of rides in descending order

arrange(desc(ride_count))

head(spatial_summary_end)
write.csv(spatial_summary_end, "spatial_end_tableau.csv", row.names = FALSE)

—————————-

Script: 04_prepare_distance_table.R

Purpose: Compute trip distance (miles) and export Tableau-ready CSV(s)

—————————-

options(repos = c(CRAN = "https://cloud.r-project.org"))
if (!require(geosphere)) install.packages("geosphere")
library(geosphere)
library(tidyverse)
library(here)

paths (set paths according to your project)

main_dir <- here::here("C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy")
cleaned_file <- file.path(main_dir, "data", "cleaned_data", "all_trips_cleaned.csv")
outputs_dir <- file.path(main_dir, "data", "outputs")
if (!dir.exists(outputs_dir)) dir.create(outputs_dir, recursive = TRUE)

# load cleaned data (Do not re-clean here)
all_trips <- read_csv(cleaned_file, show_col_types = FALSE)

# 1) Remove rows that do not contain valid coordinates
# or that have illogical zero values
clean_coords <- all_trips %>%
filter(!is.na(start_lat) & !is.na(start_lng) & !is.na(end_lat) & !is.na(end_lng)) %>%
filter(!(start_lat == 0 & start_lng == 0 & end_lat == 0 & end_lng == 0))

# 2) Calculate the distance in meters and then convert it to miles (1 meter = 0.000621371 miles)
# distHaversine expects c(lon,lat)
compute_distance_miles <- function(slng, slat, elng, elat) { 
m <- geosphere::distHaversine(c(slng, slat), c(elng, elat)) # meters 
miles <- m * 0.000621371 
return(miles)
}

# vectorised computation (fast with pmap)
distance_tbl <- clean_coords %>% 
transmute( 
ride_id, 
rideable_type, 
member_casual, 
ride_length_min, 
started_at, 
ended_at, 
start_lat, start_lng, 
end_lat, end_lng 
)%>% 
mutate( 
distance_miles = pmap_dbl(list(start_lng, start_lat, end_lng, end_lat), 
~ compute_distance_miles(..1, ..2, ..3, ..4))

)

# 3) Remove illogical values ​​(optional)
# Example: Any trip exceeding 100 miles within the city is likely a GPS error -> we label or delete it
distance_tbl <- distance_tbl %>%

mutate(distance_miles = round(distance_miles, 3)) %>%

filter(distance_miles <= 100) # You can adjust the limit as needed

# 4) Specify the final table we want to upload to Tableau
tableau_distance <- distance_tbl %>%

select(ride_id, started_at, ended_at, ride_length_min, distance_miles,
member_casual, rideable_type, start_lat, start_lng, end_lat, end_lng)

# 5) Save as one file (if the size is reasonable), otherwise split it into parts
out_file <- file.path(outputs_dir, "trips_with_distance.csv")
readr::write_csv(tableau_distance, out_file)
cat("✅ Saved full CSV to:", out_file, "\n")
## ✅ Saved full CSV to: C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/outputs/trips_with_distance.csv
# 6) -- If you want to split the file into smaller parts when needed (e.g., 1,000,000 rows per file)
chunk_size <- 1000000
n <- nrow(tableau_distance)
if (n > chunk_size) {
cat("⚠️ Large file detected (", n, " rows). Splitting into chunks of ", chunk_size, "...\n")
num_parts <- ceiling(n / chunk_size)
for (i in seq_len(num_parts)) {
start_row <- (i - 1) * chunk_size + 1
end_row <- min(i * chunk_size, n)
part <- tableau_distance[start_row:end_row, ] 
out_part <- file.path(outputs_dir, paste0("trips_with_distance_part", sprintf("%02d", i), ".csv")) 
readr::write_csv(part, out_part) 
cat("Saved:", out_part, "rows:", start_row, "-", end_row, "\n") 
}
}
## ⚠️ Large file detected ( 5831796  rows). Splitting into chunks of  1e+06 ...
## Saved: C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/outputs/trips_with_distance_part01.csv rows: 1 - 1e+06 
## Saved: C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/outputs/trips_with_distance_part02.csv rows: 1000001 - 2e+06 
## Saved: C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/outputs/trips_with_distance_part03.csv rows: 2000001 - 3e+06 
## Saved: C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/outputs/trips_with_distance_part04.csv rows: 3000001 - 4e+06 
## Saved: C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/outputs/trips_with_distance_part05.csv rows: 4000001 - 5e+06 
## Saved: C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/outputs/trips_with_distance_part06.csv rows: 5000001 - 5831796
# 7 Conclusion — Print a quick overview
cat("\nPreview:\n")
## 
## Preview:
print(glimpse(tableau_distance))
## Rows: 5,831,796
## Columns: 11
## $ ride_id         <chr> "EC2DE40644C6B0F4", "1C31AD03897EE385", "1542FBEC83041…
## $ started_at      <dttm> 2022-05-23 23:06:58, 2022-05-11 08:53:28, 2022-05-26 …
## $ ended_at        <dttm> 2022-05-23 23:40:19, 2022-05-11 09:31:22, 2022-05-26 …
## $ ride_length_min <dbl> 33.350000, 37.900000, 21.833333, 8.700000, 5.016667, 7…
## $ distance_miles  <dbl> 3.787, 0.375, 2.391, 0.499, 0.568, 0.797, 0.505, 0.950…
## $ member_casual   <chr> "member", "member", "member", "member", "member", "mem…
## $ rideable_type   <chr> "classic_bike", "classic_bike", "classic_bike", "class…
## $ start_lat       <dbl> 41.89147, 41.88096, 41.88224, 41.88224, 41.88224, 41.8…
## $ start_lng       <dbl> -87.62676, -87.61674, -87.64107, -87.64107, -87.64107,…
## $ end_lat         <dbl> 41.94367, 41.88635, 41.90765, 41.88458, 41.88578, 41.8…
## $ end_lng         <dbl> -87.64895, -87.61752, -87.67255, -87.63189, -87.65102,…
## # A tibble: 5,831,796 × 11
##    ride_id          started_at          ended_at            ride_length_min
##    <chr>            <dttm>              <dttm>                        <dbl>
##  1 EC2DE40644C6B0F4 2022-05-23 23:06:58 2022-05-23 23:40:19           33.4 
##  2 1C31AD03897EE385 2022-05-11 08:53:28 2022-05-11 09:31:22           37.9 
##  3 1542FBEC830415CF 2022-05-26 18:36:28 2022-05-26 18:58:18           21.8 
##  4 6FF59852924528F8 2022-05-10 07:30:07 2022-05-10 07:38:49            8.7 
##  5 483C52CAAE12E3AC 2022-05-10 17:31:56 2022-05-10 17:36:57            5.02
##  6 C0A3AA5A614DCE01 2022-05-04 14:48:55 2022-05-04 14:56:04            7.15
##  7 F2AF43A242DF4555 2022-05-27 12:41:48 2022-05-27 12:50:41            8.88
##  8 377BE1F5F0E399CA 2022-05-29 19:19:24 2022-05-29 19:31:34           12.2 
##  9 B136E0C969773F59 2022-05-16 17:48:44 2022-05-16 18:05:26           16.7 
## 10 75F6A50A05E0AA18 2022-05-11 07:29:29 2022-05-11 07:30:57            1.47
## # ℹ 5,831,786 more rows
## # ℹ 7 more variables: distance_miles <dbl>, member_casual <chr>,
## #   rideable_type <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>
cat("\n✅ Done. Files ready in:", outputs_dir, "\n")
## 
## ✅ Done. Files ready in: C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/data/outputs
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
library(tidyverse)
library(glue)
library(htmltools)

📊 Full Presentation

library(stringr)

# 📁 Folder Path Containing Images
img_dir <- "C:/Users/hamza/OneDrive/Bureau/DAWRA/08 Google Data Analytics Capstone/Cyclistic_CaseStudy/visualizations/raport sld google"

# 🖼️ Get All Images Starting with the Same Name
imgs <- list.files(img_dir, pattern = "^Cyclistic bik presentation.*\\.png$", full.names = TRUE)

# 🧠 Function to Extract Number from Name (or 0 if None)
extract_number <- function(path) {
fname <- basename(path)
num <- str_extract(fname, "\\((\\d+)\\)")
ifelse(is.na(num), 0, as.numeric(gsub("[()]", "", num)))
}

# 🧮 Sort images by number
imgs <- imgs[order(sapply(imgs, extract_number))]

# 🖼️ Insert images sequentially into the report
for (img in imgs) {
cat(sprintf('![](%s){width=100%%}\n\n', img))
}

#### ==========================================================
# End of Script
#### ==========================================================